Introduction

This expert system takes the books the user likes as an input and recommends similar books. The results are obtained using the genres and the author of the chosen books and return the most highly rated books in the data set that have the same genres or authors.

The data set contains 30 thousands books obtained from Goodreads site by scraping. If the user couldn't find the desired book in the data set, he/she can enter Goodreads id to obtain it. But the recommendations are always limited by the books in the dataset.

As we see in the picture below the data set is made from the following column: Book id, book title, publish year, genres of the book, number of the people rated the book, number of the book reviews, average rating, description of the book, and a link to the book img.

While developing the system

At first I used the Bayesian Formula to make a weighted score for the books based on the average rating and number of raters so the books rated from 1 user as 5 star is not the same as the book rated from 1000 users as 5 star.

The Bayesian Formula is as follows: weighted rank (WR) = (v / (v+m)) R + (m / (v+m)) C
where:
R = average for the design (mean) = (Rating)
v = number of votes for the design = (Rate Count)
m = minimum votes required to be listed in the top beers list
C = the midpoint of the scale

I chose the minimum votes to be at least in the top 80%.
I sorted the data according to the new scores.

As we see in table above, The score column is added but we have 2 problems in the data.

  1. Some books in the data are duplicated.

  2. There are books in different languages than English. </ol> We can solve the first problem easily by discarding the duplications according to the title of the book.
    To solve the second problem I used the SpaCy library to detect the language and discard all the languages except English but the results were not very accurate which resulted in discarding a lot of the English data. I decided to remove all the books that include non-ascii characters in their titles to at least reduce the number of non-English books.

Now the data is ready and we can complete our work on it.
The idea of my system is simple:

  1. Find all the top 3 genres of each book the user chose and add them to a list.

  2. Compare all the book genres in the data set to the list and give the books points according to the similar items.

  3. Repeat steps 1 and 2 for the books’ authors.

  4. Recommend the most similar items and the highest rated books to the user. </ol> Below is the books chosen by user. As we can see the genres list now is made from just 3 items at most.

Based on the chosen books above the lists will be as follows:

The problem with the above approach is that if the genre or author is repeated in the books the user read it will be the same as the unrepeated ones in the score and most similar recommendations will have at most 4 as a similartiy.

So, I decided to use dictionary instead of the list and make the genres as key and the the times repeated in the list as the value of the key.

Now if we see the charts above we notice that all the books recomended have the the genres fantasy and fiction which means that the recomindations lack of diversity.
I decided to limit the genres occurance in the recomindations to be at most 33% of the books which means just 7 books will have the same genre in the recomindations.

Examples